22 research outputs found
Trustworthy Experimentation Under Telemetry Loss
Failure to accurately measure the outcomes of an experiment can lead to bias
and incorrect conclusions. Online controlled experiments (aka AB tests) are
increasingly being used to make decisions to improve websites as well as mobile
and desktop applications. We argue that loss of telemetry data (during upload
or post-processing) can skew the results of experiments, leading to loss of
statistical power and inaccurate or erroneous conclusions. By systematically
investigating the causes of telemetry loss, we argue that it is not practical
to entirely eliminate it. Consequently, experimentation systems need to be
robust to its effects. Furthermore, we note that it is nontrivial to measure
the absolute level of telemetry loss in an experimentation system. In this
paper, we take a top-down approach towards solving this problem. We motivate
the impact of loss qualitatively using experiments in real applications
deployed at scale, and formalize the problem by presenting a theoretical
breakdown of the bias introduced by loss. Based on this foundation, we present
a general framework for quantitatively evaluating the impact of telemetry loss,
and present two solutions to measure the absolute levels of loss. This
framework is used by well-known applications at Microsoft, with millions of
users and billions of sessions. These general principles can be adopted by any
application to improve the overall trustworthiness of experimentation and
data-driven decision making.Comment: Proceedings of the 27th ACM International Conference on Information
and Knowledge Management, October 201
Analysis of Problem Tokens to Rank Factors Impacting Quality in VoIP Applications
User-perceived quality-of-experience (QoE) in internet telephony systems is
commonly evaluated using subjective ratings computed as a Mean Opinion Score
(MOS). In such systems, while user MOS can be tracked on an ongoing basis, it
does not give insight into which factors of a call induced any perceived
degradation in QoE -- it does not tell us what caused a user to have a
sub-optimal experience. For effective planning of product improvements, we are
interested in understanding the impact of each of these degrading factors,
allowing the estimation of the return (i.e., the improvement in user QoE) for a
given investment. To obtain such insights, we advocate the use of an
end-of-call "problem token questionnaire" (PTQ) which probes the user about
common call quality issues (e.g., distorted audio or frozen video) which they
may have experienced. In this paper, we show the efficacy of this questionnaire
using data gathered from over 700,000 end-of-call surveys gathered from Skype
(a large commercial VoIP application). We present a method to rank call quality
and reliability issues and address the challenge of isolating independent
factors impacting the QoE. Finally, we present representative examples of how
these problem tokens have proven to be useful in practice
Improving Meeting Inclusiveness using Speech Interruption Analysis
Meetings are a pervasive method of communication within all types of
companies and organizations, and using remote collaboration systems to conduct
meetings has increased dramatically since the COVID-19 pandemic. However, not
all meetings are inclusive, especially in terms of the participation rates
among attendees. In a recent large-scale survey conducted at Microsoft, the top
suggestion given by meeting participants for improving inclusiveness is to
improve the ability of remote participants to interrupt and acquire the floor
during meetings. We show that the use of the virtual raise hand (VRH) feature
can lead to an increase in predicted meeting inclusiveness at Microsoft. One
challenge is that VRH is used in less than 1% of all meetings. In order to
drive adoption of its usage to improve inclusiveness (and participation), we
present a machine learning-based system that predicts when a meeting
participant attempts to obtain the floor, but fails to interrupt (termed a
`failed interruption'). This prediction can be used to nudge the user to raise
their virtual hand within the meeting. We believe this is the first failed
speech interruption detector, and the performance on a realistic test set has
an area under curve (AUC) of 0.95 with a true positive rate (TPR) of 50% at a
false positive rate (FPR) of <1%. To our knowledge, this is also the first
dataset of interruption categories (including the failed interruption category)
for remote meetings. Finally, we believe this is the first such system designed
to improve meeting inclusiveness through speech interruption analysis and
active intervention
GALEX-SDSS Catalogs for Statistical Studies
We present a detailed study of the Galaxy Evolution Explorer's photometric
catalogs with special focus on the statistical properties of the All-sky and
Medium Imaging Surveys. We introduce the concept of primaries to resolve the
issue of multiple detections and follow a geometric approach to define clean
catalogs with well-understood selection functions. We cross-identify the GALEX
sources (GR2+3) with Sloan Digital Sky Survey (DR6) observations, which
indirectly provides an invaluable insight about the astrometric model of the UV
sources and allows us to revise the band merging strategy. We derive the formal
description of the GALEX footprints as well as their intersections with the
SDSS coverage along with analytic calculations of their areal coverage. The
crossmatch catalogs are made available for the public. We conclude by
illustrating the implementation of typical selection criteria in SQL for
catalog subsets geared toward statistical analyses, e.g., correlation and
luminosity function studies.Comment: 12 pages, 15 figures, accepted to Ap